Dynamically constrained, forward scheduling over uncertain workloads

ABSTRACT

Scheduling searchable items such as web pages for crawling involves dynamically scheduling items for downloading based on capacity based on time. The workload is distributed over time, in advance, by anticipating and accounting for the discovery of new links on the particular host. Respective times to download items can be determined based on the current size of the host&#39;s crawl corpus relative to the maximum size of the host&#39;s crawl corpus. The respective times may be determined based additionally on respective freshness targets for the searchable items, which characterize how often an item&#39;s content should be refreshed by re-downloading the item, and on respective politeness factors for the host, which characterize the delay time between consecutive download requests to that host. As such, one can know precisely how the system is performing at any point in time and predict future performance.

CROSS REFERENCE TO RELATED CASES

This application is a continuation of U.S. patent application Ser. No.11/642,176 filed Dec. 19, 2006 which is incorporated herein by referenceas if fully set forth herein, under 35 U.S.C. §120.

FIELD OF THE INVENTION

The present invention relates generally to workload management and, morespecifically, to dynamically scheduling use of a resource in the contextof an uncertain future workload.

BACKGROUND OF THE INVENTION Search Engines

Through the use of the Internet and the World Wide Web (“the web”),individuals have access to billions of items of information. Forexample, the web provides access to items such as web pages, pictures,songs, videos, bookmark sets, white page listings, people, etc.,generally and collectively referred to herein as “searchable items” orsimply “items.” However, a significant drawback with using the web isthat, because there is so little organization to the web, at times itcan be extremely difficult for users to locate the particular items thatcontain the information that is of interest to them. To address thisproblem, a mechanism known as a “search engine” has been developed toindex a large number of searchable items and to provide an interfacethat can be used to search the indexed information by entering certainwords or phases to be queried. These search terms are often referred toas “keywords”. A search engine is a computer program designed to findsearchable items stored in a computer system, such as the web or such asa user's desktop computer. The search engine's tasks typically includefinding searchable items, analyzing such items, and building a searchindex that supports efficient retrieval of such items.

Indexes used by search engines are conceptually similar to the normalindexes that are typically found at the end of a book, in that bothkinds of indexes comprise an ordered list of information accompaniedwith the location of the information. An “index word set” of a documentis the set of words that are mapped to the document, in an index. Forexample, an index word set of a web page is the set of words that aremapped to the web page, in a search index. For items that are notindexed, the index word set is empty.

Although there are many popular Internet search engines, they aregenerally constructed using the same three common parts. First, eachsearch engine has at least one, but typically more, “web crawler” (alsoreferred to as “crawler”, “spider”, “robot”) that “crawls” across theInternet in a methodical and automated manner to locate searchable itemsof information from around the world. Upon locating an item, the crawlerstores the item's URL, and follows any hyperlinks associated with theitem to locate other items. Second, each search engine containsinformation extraction and indexing mechanisms that extract and indexcertain information about the items that were located by the crawler. Inthe context of a web page, for example, index information is generatedbased on the contents of the HTML file associated with the web page. Theindexing mechanism stores the index information in large databases thatcan typically hold an enormous amount of information. Third, each searchengine provides a search tool that allows users, through a userinterface, to search the databases in order to locate specificsearchable items that contain information that is of interest to them,and their location on the web (e.g., a URL).

The search engine interface allows users to specify their searchcriteria (e.g., keywords) and, after performing a search, provides aninterface for displaying the search results. Typically, the searchengine orders the search results prior to presenting the search resultsto the user. The order usually takes the form of a “ranking”, where thesearchable item with the highest ranking is the item considered mostlikely to satisfy the interest reflected in the search criteriaspecified by the user. Once the matching searchable items have beendetermined, and the display order of those items has been determined,the search engine sends to the user that issued the search a “searchresults page” that presents information (e.g., URLs, titles, summaries,etc.) about the matching searchable items in the determined displayorder.

Shared Crawler Resources

Sharing a limited resource among multiple users, in an environment whereresource availability is unknown and time of exploitation is the mostimportant constraint, presents its challenges. A typical example of thisproblem can be found in the context of Internet content acquisition(e.g., web crawling) where each “user” is the uniform resourceidentifier of an instance of web content. Each content acquisition cycletakes a varying amount of time, which depends on external and unknownfactors such as network latency and host performance as well as localconstraints such as central processing unit cycles and random accessmemory availability. While traditionally this problem has beenapproached in the context of web crawlers by dividing the overall corpusinto smaller and smaller corpora, for crawling each corpus using arespective set of unshared resources, and by expanding system resourcesin response to the corresponding corpus expanding, such approaches offervery little control over resource capacity and timing issues.

Web crawlers traditionally “gorge” themselves on newly discovered linksby filling a download queue uncontrollably by simply placing new linksat the end of the queue. Some crawlers may implement download prioritiesby sorting the links from the queue as the links are output from thequeue. Hence, because the download scheduling process is substantiallyuncontrolled and every crawler system has limited resources, some of thelinks may never be crawled because they keep getting pushed to thebottom of the queue.

In order to be effective, use of shared resources in all contextsrequires some form of management of and control over access to suchresources. In the context of web crawlers, shared resources may bemanaged across the entire corpus of searchable items by creatingmultiple system clusters with different policies associated with eachcluster of machines. For example, each crawler system is configured tocrawl items having the same or similar refresh rates. However, such asystem of systems is likely to be difficult to configure, expensive tomaintain, and an inefficient use of resources. Another possible approachto achieving relatively quicker refresh of a subset of items is tosimply associate the subset with a fixed scheduling priority and,therefore, reload an associated input file every X minutes as dictatedin the priority policy. However, such a system is likely to be difficultto administer and inflexible in its approach to scheduling policies.

A Vertical Portal (also referred to as a “vortal” or simply as a“vertical”) is a portal website that provides information and resourcesfor a particular industry or topic. Verticals are the Internet's way ofcatering to consumers' focused-environment preferences, where verticalstypically provide news, research and statistics, discussions,newsletters, online tools, and many other services that educate usersabout a specific industry or topic. Constructing a vertical requirestopical crawling of the web in order to identify relevant content for agiven vertical's topic, referred to as “vertical search”. Verticalsearches require fine-grained control over short-lived importantcontent, such as link hubs, that must be re-acquired more frequentlythan non-hub content. In the context of vertical search, where thecontent acquisition process requires more precise control over thetiming of each acquisition, the non-capacity controlled approach doesnot work well enough.

Any approaches that may be described in this section are approaches thatcould be pursued, but not necessarily approaches that have beenpreviously conceived or pursued. Therefore, unless otherwise indicated,it should not be assumed that any of the approaches described in thissection qualify as prior art merely by virtue of their inclusion in thissection.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings and in whichlike reference numerals refer to similar elements and in which:

FIG. 1 is a block diagram that illustrates a functional operatingenvironment, according to an embodiment of the invention;

FIG. 2A is a diagram that illustrates an example prediction of futureworkload of a web crawler in the form of a graph depicting the number ofqueued pages in the next 24 hours, according to an embodiment of theinvention;

FIG. 2B is a diagram that illustrates an example prediction of futureworkload of a web crawler in the form of a graph depicting the number ofeligible pages for download per day over a span of days, according to anembodiment of the invention;

FIG. 3 is a diagram that illustrates a screenshot of a real-timeinterface to a crawler's per-host queues, depicting an example of whatthe crawler is doing at a particular point in time, according to anembodiment of the invention;

FIG. 4 is a flow diagram that illustrates a method for scheduling asearchable item for crawling, according to an embodiment of theinvention;

FIG. 5 is a flow diagram that illustrates a method for managing a webcrawler system, according to an embodiment of the invention; and

FIG. 6 is a block diagram that illustrates a computer system upon whichan embodiment of the invention may be implemented.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

In the following description, for the purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the present invention. It will be apparent, however, toone skilled in the art that the present invention may be practicedwithout these specific details. In other instances, well-knownstructures and devices are shown in block diagram form in order to avoidunnecessarily obscuring the present invention.

DEFINITIONS

The following definitions apply to terminology used herein.

A “web crawl” refers to the process performed by a web crawler, by whichthe web crawler system “crawls” across the Internet to locate anddownload searchable items of information (for a non-limiting example,web pages) and, upon locating such an item, stores the item's URL andhyperlinks contained within the item to locate other items.

A “freshness target” for a searchable item refers to how often asearchable item's content should be refreshed by downloading the itemfrom the web, i.e., how “fresh” the item should be, for example, forpurposes of a search engine.

“Politeness” refers to the concept of prohibiting web crawlers fromissuing consecutive download requests to a particular host faster than acertain rate. Politeness is enacted, generally, to avoid inundating ahost with crawl-related requests.

Functional Overview of Embodiments

Techniques for scheduling searchable items for crawling involvesdetecting a hyperlink to a searchable item on a particular host, whileperforming a crawl. As discussed, crawling the web, or topical portionsof the web in the case of vertical search, involves locating anddownloading searchable items and recording hyperlinks associated withthe item. Once “new” links are discovered on a web page, for example,the searchable items located via the links need to be scheduled fordownloading. Thus, rather than downloading all the new items right awayafter discovery, and rather than simply placing the new links at the endof a crawler download queue, the new items are dynamically scheduled fordownloading based on, conceptually, capacity based on time. The workloadis distributed over time, in advance, by anticipating and accounting forthe discovery of new links on the particular host.

Hence, according to an embodiment, respective times to downloadsearchable items are determined based in part on the number ofsearchable items on that host that are currently scheduled fordownloading (i.e., the “current size” of the host's crawl corpus),relative to the maximum number of searchable items on that host that areto be scheduled for downloading (i.e., the “maximum size” of the host'scrawl corpus). The respective times may be determined based additionallyon respective freshness targets for the searchable items, where afreshness target characterizes how often a searchable item's contentshould be refreshed by downloading the item from the web, and onrespective politeness factors for the host, where the politeness factorcharacterizes the minimum time between consecutive download requests tothat host. Thus, a crawler download queue is constrained to a certainnumber of items for a given host and a running count of the number ofscheduled items on that host is maintained, in order to schedule fordownload a given item on that host in view of the freshness target forthat item and the politeness factor for that host.

Consequently, searchable items from a given host are not necessarilyscheduled for download in the order that they are discovered, but aresubstantially evenly distributed over a timeline, perhaps in conjunctionwith items from other hosts, and in a manner that attempts to complywith host-related and item-related constraints. As such, one can knowprecisely how the system is performing at any point in time and predictfuture performance. If it is predicted, based on the schedule, that thehost-related and item-related constraints can not be met using a certainset of resources, e.g., a particular system of machine(s) crawling agiven corpus, then associated advisories can be issued to indicate thatthe system has insufficient processing capacity. This dramaticallyreduces support costs for the system because operators are able toquickly and efficiently detect issues with capacity or otherwise.

In the context of vertical searches, time is often the most importantconstraint because prices, fares, job listings, etc. (depending on thevertical) change daily, or even more frequently. Thus, verticalsearchers typically prefer timing over comprehensiveness. Therefore, thetechniques described herein are particularly suitable for and beneficialto vertical searches, although not limited to such.

Functional Operating Environment

Most web crawler administrators want to be able to configure crawlers byspecifying a set of constraints on the crawler's behavior. Thus,crawlers can be configured to assign each searchable item to ascheduling class. Each scheduling class specifies the scheduling-relatedconstraints for the searchable items assigned to that class, such as thefreshness target for that class. Further, crawlers can be configured toassign politeness factors to hosts, as well as to assign a maximumnumber of items downloaded per host. Therefore, such constraints can beused to dynamically schedule for downloading searchable items from oneor more hosts, in view of a crawler system's resource capacity based ontime.

FIG. 1 is a block diagram that illustrates a functional operatingenvironment, according to an embodiment of the invention. FIG. 1 depictsa web crawler 102 comprising a downloader 104 and a scheduler 106 andconnected to Internet 108, which is connected to hosts 110, 111, and nhosting searchable items 110 a, 110 b-110 n and 111 a, 111 b-111 n,host-based constraints 112 and item-based constraints 114, and a queue116.

Web crawler 102 represents a web crawler which, as described herein, isa computer software system that “crawls” across the Internet to locateand download searchable items and, upon locating such an item, storesthe item's URL and hyperlinks contained within the item to locate otheritems. Web crawler 102 is communicatively connected to a network 108,such as the Internet. Host 110, host 111, and host n are alsocommunicatively connected to the network 108, such that web crawler 102can request searchable items from each of the hosts. The number of hostsconnected to the network 108 varies and, therefore, the hosts aredepicted as host 110, host 111, and host n for convenience only. Thus,the number of hosts connected to the network 108 is not limited to threehosts or to n hosts or to any number of hosts. Host 110 is depicted ashosting searchable items 110 a, 110 b-110 n and host 111 is depicted ashosting searchable items 111 a, 111 b-111 n. Similarly, the number ofsearchable items hosted by hosts 110 and 111 varies and, therefore, thesearchable items are depicted as they are in FIG. 1 for convenience.Thus, the number of searchable items hosted by host 110 and host 111 isnot limited to three items or to n items or to any number of items. Thesearchable items depicted in FIG. 1 generally represent the corpus ofsearchable items that may be crawled by web crawler 102.

Web crawler 102 comprises a downloader 104 and a scheduler 106.Downloader 104 pulls download requests for searchable items (e.g., inthe form of URIs, Uniform Resource Identifiers) from a crawler queue 116in an order generated by scheduler 106, and uses the URIs to downloadcorresponding searchable items via a network 108. According to oneembodiment, the URIs in crawler queue 116 are segregated by host andplaced in per-host queues. Thus, in such an embodiment, each thread of amulti-thread downloader 104 works on and completes one per-host queue ata time before moving to another per-host queue. However, while onethread is working on one per-host queue, other threads of downloader 104work in parallel on other per-host queues. Consequently, while onethread may be waiting for a period of time to download another item froma particular host, i.e., due to the host politeness factor, otherthreads are still able to download items from other hosts and thecrawler system as a whole is not idle.

Scheduler 106

After downloader 104 downloads searchable items, each item's embeddedoutbound links are extracted from the item content. Searchable itemscorresponding to the new links and the item just downloaded arescheduled, by the scheduler 106, to be downloaded at a future time. Theshared resource that the scheduler 106 manages are the “download slots”available for all the hosts that a particular web crawler 102 systemintends to crawl, in an attempt to maximize overall throughput whilecomplying with and adhering to specified scheduling constraints, such asthe politeness factor and the maximum number of downloads associatedwith each host and the freshness target associated with each searchableitem. The resource being scheduled by scheduler 106 can be considered tohave the following properties. Once a download slot is allocated to aURI, the slot can be held for an unknown time, where the download cantake seconds or even minutes before completing. Meanwhile, no other URIsfor that host can be downloaded because of politeness. Once a downloadis performed, scheduler 106 should wait for the politeness intervalbefore the per-host slot becomes available again.

Hence, scheduler 106 implements host-based constraints 112 anditem-based constraints 114 in a scheduling process, which determineswhere to place each item in download queue 116. Where each searchableitem is placed in queue 116 reflects a relative time at which thatsearchable item is to be downloaded from the corresponding host 110,host 111-host n by downloader 104. How the set of searchable items froma given host are ordered in queue 116 reflects an order in which thesearchable items are downloaded by a thread of downloader 104.Host-based constraints 112 include, but are not limited to, thepoliteness factor and the maximum number of downloads associated witheach host. Item-based constraints 114 include, but are not limited to,the freshness target associated with each searchable item. Note thatdifferent searchable items hosted by a given host may be associated withdifferent freshness targets. For example, in response to downloading asearchable item, a classifier (e.g., a machine-learned semanticclassifier process) may classify the relative quality and importance ofthe item's content, perhaps with consideration to a particular topic inthe case of a vertical search. Based on the resulting classification, afreshness target is generated for the downloaded item, which reflectsthe rate at which the web crawler 102 should refresh the item based atleast in part on downloader 104 again downloading the item from itscorresponding host. With newly discovered links whose correspondingsearchable items have yet to be downloaded, a classifier may generatecorresponding freshness targets based on the links' respective parentitems, i.e., the freshness target for the searchable item in which thelink is discovered. Further, a classifier may infer information from theoutbound link itself, such as the link's anchor text, in order togenerate a corresponding freshness target for the link.

According to an embodiment, for a newly discovered URI link, thescheduler 106 uses the current number of URIs scheduled for this hostand the freshness target for this URI's scheduling class, to determinewhen a host download slot will be available and not in use by anotherURL for this host. Hence, scheduler 106 schedules the searchable itemcorresponding to this URI to be downloaded no earlier than the time atwhich the host download slot becomes available. According to oneembodiment, for a newly discovered link, the scheduler 106 uses thepoliteness factor for this host to determine when a host download slotwill be available for use. In this context, when a download slot isavailable for use is based on the slot being available for further usedue to its politeness delay expiring, and not being scheduled for use byanother URL for this host.

According to an embodiment, scheduler 106 schedules the searchable itemcorresponding to this URI to be downloaded at the time at which the hostdownload slot becomes available. However, the time at which a givensearchable item is scheduled for downloading could change, for example,in response to another new URI link being discovered that belongs to adifferent scheduling class having a sooner freshness target than thegiven searchable item. For example, a first searchable item, which isassociated with a “1 day” freshness target, may be scheduled fordownload at the earliest available download slot for that host, e.g., in11.5 hours. Subsequently, a second URI from the same host is discovered,which belongs to a different class having a “12 hour” freshness target.Thus, based in part on its more demanding freshness target, the itemcorresponding to the second URI may be scheduled for download at theearliest available download slot for that host, i.e., using the slotthat was occupied by the first item and thereby pushing the first itemback one or more slots.

According to an embodiment, for newly discovered URI links, thescheduler 106 also takes into account the maximum number of searchableitems from a given host that are to be scheduled for downloading, whichcould be a user-configurable parameter, to determine when to schedule anew item for downloading. That is, the scheduler 106 takes into accountthe number of searchable items from that host that are already scheduledin proportion to the maximum number of searchable items from that hostthat are to be scheduled, i.e., how far along is the scheduler 106 inthe scheduling process for a given host and, therefore, how many moreslots may be needed for that host.

Because scheduler 106 utilizes an ongoing, dynamic scheduling processfor a dynamic and uncertain workload, the scheduled download slots for ahost's items may be continuously changing due to decreasing theuncertainty. The uncertainty may be decreased, for example, based on theknowledge imparted through discovering new links for that host as wellas the number of searchable items from that host that are alreadyscheduled in proportion to the maximum number of searchable items fromthat host that are to be scheduled. The result being that, over the timespent crawling a given host, searchable items for that host aredistributed substantially uniformly over future time, while accountingfor the various host-based constraints 112 and item-based constraints114 that apply to each item and each item's host.

According to an embodiment, for newly discovered URI links, thescheduler 106 also uses an estimate of the corresponding host'sprocessing speed to determine when a host download slot will beavailable. The host's processing capabilities can be used to estimate,on average, how long it might take to download an item from that host.Further, metrics about how long it has actually been taking to downloaditems from a given host can be tracked, to maintain a running averagewhich accounts for the variation in download times among items ofdifferent sizes from the same host, as well as other factors.

According to an embodiment, for URI links corresponding to searchableitems that have already been downloaded, the scheduler uses thefreshness target corresponding to that item to determine when the itemshould be refreshed, e.g., downloaded again for information extraction,classification, and keyword indexing purposes. According to embodiments,one or more other factors are taken into account in scheduling old linksto be refreshed, such as other host properties (or estimates of them),including one or more of the host speed, the host maximum size, and thehost politeness factor.

The scheduling sub-system described herein (i.e., scheduler 106operating with consideration to one or more host-based constraints 112and/or item-based constraint 114) allows web crawler 102 to be, fornon-limiting examples:

(1) Dynamically Constrained, where the scheduling requirements andconstraints can be dynamically set on a running system (e.g., web pages'refresh rates changing in response to user-reconfiguration, or refreshrates computed per-page by a computer program);

(2) Forward Scheduled, where the scheduling decision as to when torefresh a searchable item is made immediately after the item isdownloaded, so the behavior of the system with respect to when futurerefreshes will be attempted can be predicted well in advance;

(3) Operational with Ad-hoc Workloads, where the scheduling decision ismade without knowing the size of the workload in the future, e.g., whilenot knowing precisely how many searchable items are contained within thehost, the host speed, or even how long the download of a particular itemmight take; and

(4) Featured with Advanced Reporting and Prediction, because the crawleris forward scheduled, future crawler behavior can be predicted inadvance, thereby enabling advanced reporting capabilities. Examples ofadvanced reporting and predication features are illustrated in FIGS. 2Aand 2B. FIG. 2A is a diagram that illustrates an example prediction offuture workload of a web crawler in the form of a graph depicting thenumber of queued pages in the next 24 hours, according to an embodimentof the invention. FIG. 2B is a diagram that illustrates a exampleprediction of future workload of a web crawler in the form of a graphdepicting the number of eligible pages for download per day over a spanof days, according to an embodiment of the invention.

According to an embodiment, advanced reporting capabilities are providedthrough a real-time interface to the crawler's per-host queues, whichdisplays dynamic queue sizes, host URI counts, and estimates of thecurrent speed of each host and/or estimated download speeds for itemsfrom the host. FIG. 3 is a diagram that illustrates a screenshot of areal-time interface to a crawler's per-host queues, depicting an exampleof what the crawler is doing at a particular point in time, according toan embodiment of the invention. Together, the various reportsillustrated in FIG. 2A, FIG. 2B, and FIG. 3 comprise a very completeview of what the crawler is working on right now as well as what thecrawler will be working on in the future. This detailed information canalso be used to predict when the system is reaching capacity, or is overcapacity, and will not be able to meet freshness targets because ofresource limitations.

The various techniques described herein result in better overall webcrawler system throughput, higher effective capacity in the ability tosatisfy more restrictive constraints on the same hardware, and morecontrol over the aggregate web crawler behavior for a given corpus.Dynamically assignable scheduling classes, with associated freshnesstargets, enable a new class of web crawler applications that are capableof fine-grained control over the refresh rates of specific web pages andother searchable items. Consequently, a web crawler (e.g., web crawler102 of FIG. 1) can perform prioritized crawling of sites so that certain“hot” pages are refreshed far more often compared to “cold” pages. Forexample, video search implementations could crawl video hub pages, whichhave frequently changing content, much faster compared to pages thathost the videos themselves. For another example, assuming one knows thata particular page results in a pop-up window using a pop-up windowdetector, the web crawler 102 could correspondingly change thescheduling priority of that page.

A Method for Scheduling Searchable Items for Crawling

FIG. 4 is a flow diagram that illustrates a method for scheduling asearchable item for crawling, according to an embodiment of theinvention. The method depicted in FIG. 4 is a computer and/ormachine-implemented method in which a computer or machine performs themethod, such as by one or more processors executing instructions. Forexample, the method may be performed on or by a computer system such ascomputer system 600 of FIG. 6.

At block 402, while performing a web crawl, a hyperlink is detected to afirst searchable item on a particular host. As discussed, web crawlinginvolves locating and downloading searchable items and recordinghyperlinks contained within the downloaded items so that the searchableitems located via the hyperlinks can be crawled. For example, downloader104 (FIG. 1) downloads searchable item 110 a (FIG. 1) from host 110(FIG. 1) via network 108 (FIG. 1), and finds a newly discovered link tosearchable item 110 b contained in searchable item 110 a.

However, at block 404, in response to detecting the hyperlink, a time atwhich to download (referred to as the “download time”) the firstsearchable item is determined, i.e., the first searchable item isscheduled for download. The download time is determined based at leastin part on a number of searchable items on the particular host that arealready scheduled for download (referred to as the “current size”), anda freshness target for the first searchable item. Continuing with theexample, scheduler 106 (FIG. 1) computes a download time for searchableitem 110 b (FIG. 1) based on the current size of the corpus for host 110(FIG. 1) and on the freshness target for searchable item 110 b fromitem-based constraints 114 (FIG. 1), and places searchable item 110 bappropriately along the timeline of crawler download queue 116.

Let R refer to how far into the future a searchable item is scheduledfor download, i.e., the higher R is for a given item, the further intothe future the item is scheduled for download. According to anembodiment, the larger the freshness target (specified in some units oftime) for a given item, the higher R is for that item. Generally, thismeans that a given item with a relatively higher freshness target thananother item can be safely pushed into the future farther than the otheritem, which needs to be downloaded sooner than the given item to meetthe other item's freshness target. According to an embodiment, thelarger the current size for a given item's host, the higher R is forthat item. This means that a given item whose link is detected after thelink for another item from the same host can, independent of the items'respective freshness targets, be safely pushed into the future fartherthan the other item. This is because, generally, the countdown to thelater-discovered item's freshness target begins later than the countdownfor the earlier-discovered item's freshness target, and more of thecorpus is known about as the host's current size increases and there arefewer additional items from the same host to anticipate and plan for inthe scheduling process. Thus, a given item whose link is detected on agiven host at time t₁ before detection of the link for another item froma different host at time t₂, where the current size for the given item'shost is greater than for the other item's host, can be safely pushedinto the future farther than the other item even though the given itemwas detected before the other item.

According to an embodiment, at block 404, the download time for thefirst searchable item is determined based at least in part on thecurrent size relative to the maximum number of searchable items on theparticular host that are to be scheduled for downloading (referred to asthe “maximum size”), and the freshness target for the first searchableitem. According to an embodiment, the larger the proportion of thecurrent size to the maximum size, the higher R is. As above, this isalso because of the freshness target countdown start time and theknowledge about the host corpus, where the knowledge about the hostcorpus now includes more concrete knowledge about how many additionalsearchable items may need to be scheduled before the maximum size isreached. Thus, a given item whose link is detected on a given host attime t₁ before detection of the link for another item from a differenthost at time t₂, where the proportion of current size to maximum sizefor the given item's host is greater than for the other item's host, canbe safely pushed into the future farther than the other item even thoughthe given item was detected before the other item. Dynamicconsideration, by scheduler 106 (FIG. 1), of the ratio of the currentsize to the maximum size for a given searchable item's host, inconjunction with the freshness target for the given item, provides anoptimizing scheduling function that complies with such constraints whileperforming a download schedule distribution that maximizes overallthroughput of the crawler system over multiple hosts and multiplescheduling classes, as represented by corresponding freshness targets.

At block 406, the first searchable item is scheduled for download basedon the download time determined at block 404. Continuing with theexample, scheduler 106 (FIG. 1) places searchable item 110 b (FIG. 1)appropriately along the timeline of crawler download queue 116 (FIG. 1).According to an embodiment, a searchable item is queued for downloadingby associating a timestamp with the item. The timestamp may not be anabsolute real time, rather the timestamp represents the item's relativeposition in the download queue 116, where sufficient margin ismaintained between “consecutive” timestamps so that later-discovereditems can be placed between consecutive timestamps if necessary, e.g.,to “squeeze” a new link between two existing scheduled links.

With past approaches, it is not determinable with a reasonable degree ofprecision or accuracy when a particular item would be crawled, and it isnot easy to adapt to time constraints. Based on the foregoingdescription, one non-limiting practical application of the methodembodied in blocks 402-406 is that hyperlinks that are discovered duringa web crawling process are, in real-time, scheduled for downloading at afuture time by a web crawler 102 (FIG. 1). For example, the crawlerdownload queue 116, which embodies the download schedule, is a useful,concrete, and tangible result of the method illustrated in FIG. 4, inthat download requests are queued and eventually submitted to therespective hosts 110 a, 110 b-110 n (FIG. 1) by the downloader 104 (FIG.1). From this web crawling process, a vast number of searchable itemsare indexed, whereby such items can be located by the public through useof an associated search engine for which the items are indexed.Furthermore, the future and current workload of the web crawler 102 canbe displayed and viewed, for non-limiting examples, through the graphsillustrated in FIGS. 2A and 2B and the interface illustrated in FIG. 3.

According to an embodiment, the method depicted by blocks 402-406 isextended to include blocks 408-412 of FIG. 4. At block 408, the firstsearchable item is downloaded based on the time determined and scheduledat blocks 404-406. Continuing with the example, downloader 104 (FIG. 1)of web crawler 102 (FIG. 1) downloads searchable item 110 b from host110 (FIG. 1) at a time based on its placement by scheduler 106 (FIG. 1)along the timeline of crawler download queue 116 (FIG. 1).

At block 410, a time to re-download the first searchable item isdetermined, based at least in part on the freshness target for the firstsearchable item. Of course, other factors may be considered indetermining the re-download time for the item, such as the processingcapabilities of the relevant host (e.g., CPU type, CPU speed, memorysize, bus speed, etc.) and/or host-based constraints like the politenessfactor corresponding to the relevant host or the current size or themaximum size for the host. At block 412, the first searchable item isscheduled for downloading based on the time to re-download determined atblock 410. Continuing with the example, scheduler 106 (FIG. 1) againplaces searchable item 110 b (FIG. 1) appropriately along the timelineof crawler download queue 116 (FIG. 1), for re-downloading searchableitem 110 b at a future time in order to refresh the item for the searchengine, e.g., to extract and re-index the current content of searchableitem 110 b.

A Method for Managing a Web Crawler System

FIG. 5 is a flow diagram that illustrates a method for managing a webcrawler system, according to an embodiment of the invention. The methoddepicted in FIG. 5 is a computer and/or machine-implemented method inwhich a computer or machine performs the method, such as by one or moreprocessors executing instructions. For example, the method may beperformed on or by a computer system such as computer system 600 of FIG.6. The method of FIG. 5 generally reflects the forward-looking,predictive nature of the scheduling techniques described herein.

At block 502, while performing a web crawl using a particular webcrawler system, detect hyperlinks to a plurality of searchable itemshosted by a particular host. As discussed, web crawling involveslocating and downloading searchable items and recording hyperlinkscontained within the downloaded items so that the searchable itemslocated via the hyperlinks can be crawled. For example, downloader 104(FIG. 1) downloads searchable item 110 a (FIG. 1) from host 110 (FIG. 1)via network 108 (FIG. 1), and finds newly discovered links to searchableitem 110 b-110 n contained in searchable item 110 a.

At block 504, read one or more freshness targets that correspond to theplurality of searchable items. As discussed, freshness targets specifyhow often corresponding searchable items should be crawled. A freshnesstarget may be associated with an entire scheduling class of searchableitems and may vary from item to item on a given host. Thus, freshnesstargets are considered item-based constraints, such as item-basedconstraints 114 (FIG. 1).

At block 506, read a politeness factor that corresponds to theparticular host. As discussed, politeness factors specify a rate forsubmitting consecutive download requests to a particular host, i.e., howmuch time should pass between consecutive downloads from a given host.Politeness factors are considered host-based constraints, such ashost-based constraints 112 (FIG. 1).

At block 508, based at least in part of the freshness targets and thepoliteness factor, determine whether the particular crawler system hasenough processing capacity to crawl the plurality of searchable items incompliance with the freshness targets and the politeness factor. Forexample, based on forward looking graphs such as those depicted in FIGS.2A and 2B, where the crawler download schedule information depicted wasgenerated by taking into account the freshness targets and thepoliteness factor, it is determinable whether the corresponding crawlersystem is computationally powerful enough to keep up with the depictedschedule.

At block 510, if it was determined at block 508 that the crawler systemdoes not have enough processing capacity to keep up with the downloadschedule required to comply with the freshness targets and thepoliteness factor, then generate a message indicating that the crawlersystem does not have enough capacity. The form which such an “undercapacity” message takes may vary from implementation to implementation.For non-limiting examples, the message could be exposed and externalizedthrough a user interface or could simply be a message that is internalto the crawler system to trigger an event or some remedial action.

Hardware Overview

FIG. 6 is a block diagram that illustrates a computer system 600 uponwhich an embodiment of the invention may be implemented. Computer system600 includes a bus 602 or other communication mechanism forcommunicating information, and a processor 604 coupled with bus 602 forprocessing information. Computer system 600 also includes a main memory606, such as a random access memory (RAM) or other dynamic storagedevice, coupled to bus 602 for storing information and instructions tobe executed by processor 604. Main memory 606 also may be used forstoring temporary variables or other intermediate information duringexecution of instructions to be executed by processor 604. Computersystem 600 further includes a read only memory (ROM) 608 or other staticstorage device coupled to bus 602 for storing static information andinstructions for processor 604. A storage device 610, such as a magneticdisk or optical disk, is provided and coupled to bus 602 for storinginformation and instructions.

Computer system 600 may be coupled via bus 602 to a display 612, such asa cathode ray tube (CRT) or liquid crystal display (LCD), for displayinginformation to a computer user. An input device 614, includingalphanumeric and other keys, is coupled to bus 602 for communicatinginformation and command selections to processor 604. Another type ofuser input device is cursor control 616, such as a mouse, a trackball,or cursor direction keys for communicating direction information andcommand selections to processor 604 and for controlling cursor movementon display 612. This input device typically has two degrees of freedomin two axes, a first axis (e.g., x) and a second axis (e.g., y), thatallows the device to specify positions in a plane.

The invention is related to the use of computer system 600 forimplementing the techniques described herein. According to an embodimentof the invention, those techniques are performed by computer system 600in response to processor 604 executing one or more sequences of one ormore instructions contained in main memory 606. Such instructions may beread into main memory 606 from another machine-readable medium, such asstorage device 610. Execution of the sequences of instructions containedin main memory 606 causes processor 604 to perform the process stepsdescribed herein. In alternative embodiments, hard-wired circuitry maybe used in place of or in combination with software instructions toimplement the invention. Thus, embodiments of the invention are notlimited to any specific combination of hardware circuitry and software.

The term “machine-readable medium” as used herein refers to any mediumthat participates in providing data that causes a machine to operationin a specific fashion. In an embodiment implemented using computersystem 600, various machine-readable media are involved, for example, inproviding instructions to processor 604 for execution. Such a medium maytake many forms, including but not limited to, non-volatile media,volatile media, and transmission media. Non-volatile media includes, forexample, optical or magnetic or magneto-optical disks, such as storagedevice 610. Volatile media includes dynamic memory, such as main memory606. Transmission media includes coaxial cables, copper wire and fiberoptics, including the wires that comprise bus 602. Transmission mediacan also take the form of acoustic or light waves, such as thosegenerated during radio-wave and infra-red data communications.

Common forms of machine-readable media include, for example, a floppydisk, a flexible disk, hard disk, magnetic tape, or any other magneticmedium, a CD-ROM, any other optical medium, punchcards, papertape, anyother physical medium with patterns of holes, a RAM, a PROM, and EPROM,a FLASH-EPROM, any other memory chip or cartridge, a carrier wave asdescribed hereinafter, or any other medium from which a computer canread.

Various forms of machine-readable media may be involved in carrying oneor more sequences of one or more instructions to processor 604 forexecution. For example, the instructions may initially be carried on amagnetic disk of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over atelephone line using a modem. A modem local to computer system 600 canreceive the data on the telephone line and use an infra-red transmitterto convert the data to an infra-red signal. An infra-red detector canreceive the data carried in the infra-red signal and appropriatecircuitry can place the data on bus 602. Bus 602 carries the data tomain memory 606, from which processor 604 retrieves and executes theinstructions. The instructions received by main memory 606 mayoptionally be stored on storage device 610 either before or afterexecution by processor 604.

Computer system 600 also includes a communication interface 618 coupledto bus 602. Communication interface 618 provides a two-way datacommunication coupling to a network link 620 that is connected to alocal network 622. For example, communication interface 618 may be adigital subscriber line (DSL), cable, or integrated services digitalnetwork (ISDN) card or a modem to provide a data communicationconnection to a corresponding type of telephone line. As anotherexample, communication interface 618 may be a local area network (LAN)card to provide a data communication connection to a compatible LAN.Wireless links may also be implemented. In any such implementation,communication interface 618 sends and receives electrical,electromagnetic or optical signals that carry digital data streamsrepresenting various types of information.

Network link 620 typically provides data communication through one ormore networks to other data devices. For example, network link 620 mayprovide a connection through local network 622 to a host computer 624 orto data equipment operated by an Internet Service Provider (ISP) 626.ISP 626 in turn provides data communication services through the worldwide packet data communication network now commonly referred to as the“Internet” 628. Local network 622 and Internet 628 both use electrical,electromagnetic or optical signals that carry digital data streams. Thesignals through the various networks and the signals on network link 620and through communication interface 618, which carry the digital data toand from computer system 600, are exemplary forms of carrier wavestransporting the information.

Computer system 600 can send messages and receive data, includingprogram code, through the network(s), network link 620 and communicationinterface 618. In the Internet example, a server 630 might transmit arequested code for an application program through Internet 628, ISP 626,local network 622 and communication interface 618.

The received code may be executed by processor 604 as it is received,and/or stored in storage device 610, or other non-volatile storage forlater execution. In this manner, computer system 600 may obtainapplication code in the form of a carrier wave.

Extensions and Alternatives

In the foregoing specification, embodiments of the invention have beendescribed with reference to numerous specific details that may vary fromimplementation to implementation. Thus, the sole and exclusive indicatorof what is the invention, and is intended by the applicants to be theinvention, is the set of claims that issue from this application, in thespecific form in which such claims issue, including any subsequentcorrection. Any definitions expressly set forth herein for termscontained in such claims shall govern the meaning of such terms as usedin the claims. Hence, no limitation, element, property, feature,advantage or attribute that is not expressly recited in a claim shouldlimit the scope of such claim in any way. The specification and drawingsare, accordingly, to be regarded in an illustrative rather than arestrictive sense.

Alternative embodiments of the invention are described throughout theforegoing specification, and in locations that best facilitateunderstanding the context of the embodiments. Furthermore, the inventionhas been described with reference to specific embodiments thereof. Itwill, however, be evident that various modifications and changes may bemade thereto without departing from the broader spirit and scope of theinvention. Hence, embodiments of the invention can be generalized acrossmany dimensions. For example, the techniques may be applied to otherproblem domains, such as for CPU scheduling in a realtime-scheduledsystem, or for scheduling the periodic backups of a network of machinesthat must be completed over and within a certain period of time. Foranother example, the techniques may be applied to other constrainttypes, such as for use in scheduling where the constraint is nottime-related, e.g., for estimating how to store data coming from manyinbound streams such as in a backup server. Here, the constraint couldbe that at least half of each data stream must stored, and the schedulercould be used to determine which stream to sample next as data becomesavailable, subject to the availability of the disk resource. For yetanother example, the techniques may be applied to other versions of theproblem where more information is known, i.e., because the scheduler canhandle situations where the scheduling constraints are dynamic, thescheduler can certainly handle situations where the constraints arestatic. Likewise, because the scheduler can handle situations where theworkload is not known, the scheduler can also handle situations wherethe workload composition is known well in advance.

In addition, in this description certain process steps are set forth ina particular order, and alphabetic and alphanumeric labels may be usedto identify certain steps. Unless specifically stated in thedescription, embodiments of the invention are not necessarily limited toany particular order of carrying out such steps. In particular, thelabels are used merely for convenient identification of steps, and arenot intended to specify or require a particular order of carrying outsuch steps.

1. A machine-implemented method comprising: downloading a firstsearchable item from a particular host; determining a quality metric forthe first searchable item; determining a freshness target for the firstsearchable item based at least in part on the quality metric for thefirst searchable item; and using the freshness target to determine whento re-download the first searchable item.
 2. The method of claim 1,further comprising using a number of searchable items on the particularhost that are scheduled for downloading to determine when to re-downloadthe first searchable item.
 3. The method of claim 2, further comprisingusing the number of searchable items on the particular host that arescheduled for downloading relative to a maximum number of searchableitems on the particular host that are to be scheduled for downloading todetermine when to re-download the first searchable item.
 4. The methodof claim 1, further comprising: downloading a second searchable item onthe particular host after downloading the first searchable item;determining a quality metric for the second searchable item; determininga second freshness target for the second searchable item based at leastin part on the quality metric for the second searchable item; and usingthe second freshness target to determine when to re-download the secondsearchable item, wherein the second searchable item is re-downloadedbefore the first searchable item is re-downloaded.
 5. The method ofclaim 4, wherein the quality metric for the second searchable item isdifferent than the quality metric for the first searchable item.
 6. Themethod of claim 4, further comprising using the number of searchableitems on the particular host that are scheduled for downloading relativeto a maximum number of searchable items on the particular host that areto be scheduled for downloading to determine when to re-download thesecond searchable item.
 7. The method of claim 1, wherein saidparticular host is a first host, the method further comprising:downloading a second searchable item on a second host after downloadingthe first searchable item, wherein the second host is a different hostfrom the first host; determining a quality metric for the secondsearchable item; determining a second freshness target for the secondsearchable item based at least in part on the quality metric for thesecond searchable item; and using the second freshness target todetermine when to re-download the second searchable item; wherein thesecond searchable item is re-downloaded before the first searchable itemis re-downloaded.
 8. The method of claim 7, wherein the first and secondsearchable items are re-downloaded using shared resources used to crawla plurality of hosts.
 9. The method of claim 7, further comprising usinga number of searchable items on the second host that are scheduled fordownloading to determine when to re-download the second searchable item.10. The method of claim 9, further comprising using the number ofsearchable items on the second host that are scheduled for downloadingrelative to a maximum number of searchable items on the second host thatare to be scheduled for downloading to determine when to re-download thesecond searchable item.
 11. The method of claim 1, further comprisingusing processing capabilities of the particular host to determine whento re-download the first searchable item.
 12. The method of claim 1,further comprising: determining a politeness factor that specifies adelay between consecutive download requests submitted to the particularhost; and using the politeness factor to determine when to re-downloadthe first searchable item.
 13. The method of claim 1, furthercomprising: re-downloading the first searchable item; and using thefreshness target to determine when to again re-download the firstsearchable item.
 14. The method of claim 13, further comprising usingthe processing capabilities of the particular host to determine when toagain re-download the first searchable item.
 15. The method of claim 13,further comprising using a politeness factor that specifies a delaybetween consecutive download requests submitted the said particular hostto determine when to again re-download the first searchable item.
 16. Amachine-implemented method comprising: downloading a plurality ofsearchable items hosted by a particular host, reading one or morefreshness targets that correspond to the plurality of searchable items,wherein a freshness target specifies how often a correspondingsearchable item should be crawled, wherein a freshness target is basedat least in part on the quality metric for one or more of the pluralityof searchable items; reading a politeness factor that corresponds to theparticular host, wherein the politeness factor specifies a rate forsubmitting consecutive download requests to the particular host;determining, based at least in part on the freshness targets and thepoliteness factor, whether the particular crawler system has enoughprocessing capacity to crawl the plurality of searchable items incompliance with the freshness targets and the politeness factor; and ifdetermined the particular crawler system does not have enough processingcapacity to crawl the plurality of searchable items in compliance withthe freshness targets and the politeness factor, then generating amessage indicating that the particular crawler system does not haveenough processing capacity.
 17. The method of claim 16, furthercomprising: causing a graphical display of a number of searchable itemsqueued for downloading per hour for a certain number of hours in thefuture.
 18. The method of claim 16, further comprising: causing displayof a current operational status of the particular crawler system,wherein the operational status includes a number of searchable itemsqueued for downloading for each of one or more hosts.
 19. The method ofclaim 18, wherein the operational status includes an estimated time perdownload for the searchable items for each of the one or more hosts.